﻿ An Empirical Investigation of the Relation Between Discourse Structure and Co-Reference Dan CristeaNancy Ide Department of Computer ScienceDepartment of Computer Science University “A I Cuza”Vassar College Ias¸i, RomˆaniaPoughkeepsie, NY, USA dcristea@infoiasi roide@cs vassar edu Daniel MarcuValentin Tablan Information Sciences Institute andDepartment of Computer Science Department of Computer ScienceUniversity of Shefﬁeld University of Southern CaliforniaUnited Kingdom Los Angeles, CA, USAv tablan@shefﬁeld ac uk marcu@isi edu Abstract1997; Kameyama, 1997) In other cases, these modules We compare the potential of two classes of linear and hi-are integrated by means of statistical (Ge et al , 1998) or erarchical models of discourse to determine co-referenceuncertainty reasoning techniques (Mitkov, 1997) links and resolve anaphors The comparison uses a cor-The fact that current anaphora resolution systems rely pus of thirty texts, which were manually annotated forexclusively on the linear nature of texts in order to de- co-reference and discourse structure termine the LPA of an anaphor seems odd, given that several studies have claimed that there is a strong rela- 1 Introductiontion between discourse structure and reference (Sidner, Most current anaphora resolution systems implement a1981; Grosz and Sidner, 1986; Grosz et al , 1995; Fox, pipeline architecture with three modules (Lappin and Le-1987; Vonk et al , 1992; Azzam et al , 1998; Hitzeman ass, 1994; Mitkov, 1997; Kameyama, 1997) and Poesio, 1998) These studies claim, on the one hand, that the use of referents in naturally occurring texts im- 1 A COLLECTmodule determines a list of potentialposes constraints on the interpretation of discourse; and, antecedents (LPA) for each anaphor (pronoun, deﬁ-on the other, that the structure of discourse constrains the nite noun, proper name, etc ) that have the potentialLPAs to which anaphors can be resolved The oddness to resolve it of the situation can be explained by the fact that both 2 A FILTERmodule eliminates referees incompatiblegroups seemprima facieto be right Empirical exper- with the anaphor from the LPA iments studies that employ linear techniques for deter- mining the LPAs of anaphors report recall and precision 3 A PREFERENCEmodule determines the most likelyanaphora resolution results in the range of 80% (Lappin antecedent on the basis of an ordering policy and Leass, 1994; Ge et al , 1998) Empirical experiments In most cases, the COLLECTmodule determines an LPAthat investigated the relation between discourse structure by enumerating all antecedents in a window of text thatand reference also claim that by exploiting the structure precedes the anaphor under scrutiny (Hobbs, 1978; Lap-of discourse one has the potential of determining correct pin and Leass, 1994; Mitkov, 1997; Kameyama, 1997;co-referential links for more than 80% of the referential Ge et al , 1998) This window can be as small as twoexpressions (Fox, 1987; Cristea et al , 1998) although to or three sentences or as large as the entire precedingdate, no discourse-based anaphora resolution system has text The FILTERmodule usually imposes semantic con-been implemented Since no direct comparison of these straints by requiring that the anaphor and potential an-two classes of approaches has been made, it is difﬁcult to tecedents have the same number and gender, that selec-determine which group is right, and what method is the best tional restrictions are obeyed, etc The PREFERENCE module imposes preferences on potential antecedentsIn this paper, we attempt toﬁll this gap by empiri- on the basis of their grammatical roles, parallelism,cally comparing the potential of linear and hierarchical frequency, proximity, etc In some cases, anaphoramodels of discourse to correctly establish co-referential resolution systems implement these modules explic-links in texts, and hence, their potential to correctly re- itly (Hobbs, 1978; Lappin and Leass, 1994; Mitkov,solve anaphors Since it is likely that both linear- and discourse-based anaphora resolution systems can imple- On leave from the Faculty of Computer Science, University “Al I ment similar FILTERand PREFERENCEstrategies, we fo- Cuza” of Iasi cus here only on the strategies that can be used to COL- LECTlists of potential antecedents Speciﬁcally, we fo-Theory (VT) (Cristea et al , 1998), which is described cus on determining whether discourse theories can helpbrieﬂy below an anaphora resolution system determine LPAs that are ”better” than the LPAs that can be computed from a lin-2 2 Veins Theory ear interpretation of texts Section 2 outlines the theoreti-VT extends and formalizes the relation between dis- cal assumptions of our empirical investigation Section 3course structure and reference proposed by Fox (1987) describes our experiment We conclude with a discussionIt identiﬁes ”veins”, i e , chains of elementary discourse of the results units, over discourse structure trees that are built accord- ing to the requirements put forth in Rhetorical Structure 2 BackgroundTheory (RST) (Mann and Thompson, 1988) 2 1 AssumptionsOne of the conjectures of VT is that the vein expres- sion of an elementary discourse unit provides a coher- Our approach is based on the following assumptions:ent ”abstract” of the discourse fragment that contains 1 For each anaphor in a text, an anaphora resolutionthat unit As an internally coherent discourse fragment, system must produce an LPA that contains a refer-most of the anaphors and referential expressions (REs) ent to which the anaphor can be resolved The sizein a unit must be resolved to referees that occur in the of this LPA varies from system to system, depend-text subsumed by the units in the vein This conjec- ing on the theory a system implements ture is consistent with Fox’s view (1987) that the units that contain referees to which anaphors can be resolved 2 The smaller the LPA (while retaining a correct an-are determined by the nuclearity of the discourse units tecedent), the less likely that errors in the FILTERthat precede the anaphors and the overall structure of dis- and PREFERENCEmodules will affect the ability ofcourse According to VT, REs of both satellites and nu- a system to select the appropriate referent clei can access referees of hierarchically preceding nu- 3 Theory A is better than theory B for the task of ref-cleus nodes REs of nuclei can mainly access referees of erence resolution if theory A produces LPAs thatpreceding nuclei nodes and of directly subordinated, pre- contain more antecedents to which anaphors can beceding satellite nodes And the interposition of a nucleus correctly resolved than theory B, and if the LPAsafter a satellite blocks the accessibility of the satellite for produced by theory A are smaller than those pro-all nodes that are lower in the corresponding discourse duced by theory B For example, if for a givenstructure (see (Cristea et al , 1998) for a full deﬁnition) anaphor, theory A produces an LPA that contains aHence, the fundamental intuition underlying VT is referee to which the anaphor can be resolved, whilethat the RST-speciﬁc distinction between nuclei and theory B produces an LPA that does not containsatellites constrains the range of referents to which such a referee, theory A is better than theory B anaphors can be resolved; in other words, the nucleus- Moreover, if for a given anaphor, theory A producessatellite distinction induces for each anaphor (and each an LPA with two referees and theory B produces anreferential expression) a Domain of Referential Acces- LPA with seven referees (each LPA containing a ref-sibility (DRA) For each anaphorin a discourse unit eree to which the anaphor can be resolved), theory, VT hypothesizes thatcan be resolved by examin- A is considered better than theory B because it has aing referential expressions that were used in a subset of higher probability of solving that anaphor correctly the discourse units that precede; this subset is called theDRAof For any elementary unitin a text, the We consider two classes of models for determining thecorrespondingDRAis computed automatically from the LPAs of anaphors in a text:rhetorical representation of that text in two steps: Linear-k models This is a class of linear models in1 Headsfor each node are computed bottom-up over which the LPAs include all the references found in thethe rhetorical representation tree Heads of elemen- discourse unit under scrutiny and the k discourse unitstary discourse units are the units themselves Heads that immediately precede it Linear-0 models an ap-of internal nodes, i e , discourse spans, are com- proach that assumes that all anaphors can be resolvedputed by taking the union of the heads of the im- intra-unit; Linear-1 models an approach that correspondsmediate child nodes that are nuclei For example, roughly to centering (Grosz et al , 1995) Linear-k is con-for the text in Figure 1, whose rhetorical structure is sistent with the assumptions that underlie most currentshown in Figure 2, the head of span is unit 5 anaphora resolution systems, which look backunits inbecause the head of the immediate nucleus, the ele- order to resolve an anaphor mentary unit 5, is 5 However, the head of span Discourse-VT-k models In this class of models, LPAsis the list6,7because both immediate children are include all the referential expressions found in the dis-nuclei of a multinuclear relation course unit under scrutiny and thediscourse units that2 Using the results of step 1,Veinexpressions are hierarchicallyprecede it The units that hierarchicallycomputed top-down for each node in the tree The precede a given unit are determined according to Veinsvein of the root is its head Veins of child nodes space required to resolve referential expressions when using Linear models vs Discourse-VT models For ex- ample, for text and the RST tree in Figures 1 and 2, the Discourse-VT model narrows the search space required to resolve the anaphorthe smaller companyin unit 9 According to VT, we look for potential antecedents for the smaller companyin theDRAof unit 9, which lists units 1, 8, and 9 The antecedentGenetic Therapy, Inc appears in unit 1; therefore, using VT we search back 2 units (units 8 and 1) toﬁnd a correct antecedent In con- trast, to resolve the same reference using a linear model, four units (units 8, 7, 6, and 5) must be examined be- foreGenetic Therapyis found Assuming that referen- tial links are established as the text is processed,Genetic Therapywould be linked back to pronounitsin unit 2, which would in turn be linked to theﬁrst occurrence of the antecedent,Genetic Therapy, Inc ,in unit 1, the an- tecedent determined directly by using VT In general, when hierarchical adjacency is considered, an anaphor may be resolved to a referent that is not the closest in a linear interpretation of a text Similarly, a ref- erential expression can be linked to a referee that is not the closest in a linear interpretation of a text However, this does not create problems because we are focusing here only on co-referential relations of identity (see sec- tion 3) Since these relations induce equivalence classes over the set of referential expressions in a text, it is sufﬁ- Figure 1: An example of text and its elementary units cient that an anaphor or referential expression is resolved The referential expressions surrounded by boxes and el-to any of the members of the relevant equivalence class lipses correspond to two distinct co-referential equiv-For example, according to VT, the referential expression alence classes Referential expressions surrounded byMr Caseyin unit 5 in Figure 1 can be linked directly boxes refer toMr Casey; those surrounded by ellipsesonly to the refereeMr Caseyin unit 1, because theDRA refer toGenetic Therapy Inc of unit 5 is1,5 By considering the co-referential links of the REs in the other units, the full equivalence class can be determined This is consistent with the distinction are computed recursively according to the rules de-between ”direct” and ”indirect” references discussed by scribed by Cristea et al (1998) TheDRAof a unitCristea, et al (1998) is given by the units that precedein the vein For example, for the text and RST tree in Figures 1 and 2, the vein expression of unit 3, which contains3 The Experiment units 1 and 3, suggests that anaphors from unit 3 should be resolved only to referential expressions in units 1 and 3 Because unit 2 is a satellite to3 1 Materials unit 1, it is considered to be ”blocked” to referen- tial links from unit 3 In contrast, theDRAof unit We used thirty newspaper texts whose lengths varied 9, consisting of units 1, 8, and 9, reﬂects the intu- widely; the meanis 408 words and the standard de- ition that anaphors from unit 9 can be resolved only viationis 376 The texts were annotated manually for to referential expressions from unit 1, which is the co-reference relations of identity (Hirschman and Chin- most important unit in span , and to unit 8, a chor, 1997) The co-reference relations deﬁne equiv- satellite that immediately precedes unit 9 Figure 2 alence classes on the set of all marked referents in a shows the heads and veins of all internal nodes in text The texts were also manually annotated by Marcu the rhetorical representation et al (1999) with discourse structures built in the style of Mann and Thompson (1988) Each discourse analy- 2 3 Comparing models sis yielded an average of 52 elementary discourse units The premise underlying our experiment is that there areSee (Hirschman and Chinchor, 1997) and (Marcu et al , potentially signiﬁcant differences in the size of the search1999) for details of the annotation processes Figure 2: The RST analysis of the text inﬁgure 1 The tree is represented using the conventions proposed by Mann and Thompson (1988) 3 2 Comparing potential to establish co-referentialcompanyand another referential expression in units 9, 8, links7, or 6 For the Discourse-VT-3 model and the same ref- 3 2 1 Methoderential expression, we estimate whether a co-referential link can be established betweenthe smaller companyand The annotations for co-reference relations and rhetoricalanother referential expression in units 9, 8, 1, or 7, which structure trees for the thirty texts were fused, yieldingcorrespond to theDRAof unit 9 (units 9, 8, and 1) and to representations that reﬂect not only the discourse struc-unit 7, the closest unit preceding unit 9 that is not in its ture, but also the co-reference equivalence classes spe-DRA ciﬁc to each text Based on this information, we eval-For the Discourse-VT-k models, we assume that the uated the potential of each of the two classes of mod-ExtendedDRA(EDRA) of sizeof a unitEDRA els discussed in section 2 (Linear-k and Discourse-VT-k)is given by theﬁrstunits of a sequence that to correctly establish co-referential links as follows: Forlists, in reverse order, the units of theDRAofplus each model, each, and each marked referential expres-theunits that precedebut are not in itsDRA sion, we determined whether or not the correspondingFor example, for the text in Figure 1, the follow- LPA (deﬁned overelementary units) contained a ref-ing relations hold:EDRAEDRA eree from the same equivalence class For example, forEDRAEDRA the Linear-2 model and referential expressionthe smallerEDRA For Linear-k models, the companyin unit 9, we estimated whether a co-referentialEDRAis given byand theunits that immedi- link could be established betweenthe smaller companyately precede and another referential expression in units 7, 8, or 9 The potentialEDRAof a modelto de- For the Discourse-VT-2 model and the same referentialtermine correct co-referential links with respect to a ref- expression, we estimated whether a co-referential linkerential expressionin unit, given a corresponding could be established betweenthe smaller companyandEDRAof sizeEDRA, is assigned the value 1 if another referential expression in units 1, 8, or 9, whichtheEDRAcontains a co-referent from the same equiva- correspond to theDRAof unit 9 lence class as Otherwise,EDRAis assigned To enable a fair comparison of the two models, whenthe value 0 The potentialof a model is larger than the size of theDRAof a given unit, we ex-to determine correct co-referential links for all referen- tend thatDRAusing the closest units that precede the unittial expressions in a corpus of texts, usingEDRAs under scrutiny and are not already in theDRA Hence,of size, is computed as the sum of the potentials for the Linear-3 model and the referential expressiontheEDRAof all referential expressionsin smaller companyin unit 9, we estimate whether a co-This potential is normalized to a value between 0 and referential link can be established betweenthe smaller1 by dividingby the number of referential expressions in the corpus that have an antecedent By examining the potential of each model to correctly determine co-referential expressions for each, it is pos- sible to determine the degree to which an implementa- tion of a given approach can contribute to the overall efﬁciency of anaphora resolution systems That is, if a given model has the potential to correctly determine a signiﬁcant percentage of co-referential expressions with smallDRAs, an anaphora resolution system implement- ing that model will have to consider fewer options over- all Hence, the probability of error is reduced 3 2 2 Results The graph in Figure 3 shows the potentials of the Linear- k and Discourse-VT-k models to correctly determine co- referential links for eachfrom 1 to 20 The graph in Figure 4 represents the same potentials but focuses onlyFigure 3: The potential of Linear-k and Discourse-VT- ons in the interval As these two graphs show, thek models to determine correct co-referential links potentials increase monotonically with, the VT-k mod- els always doing better than the Linear-k models Even- tually, for larges, the potential performance of the two models converges to 100% The graphs in Figures 3 and 4 also suggest resolution strategies for implemented systems For example, the graphs suggests that by choosing to work withEDRAs of size 7, a discourse-based system has the potential of resolving more than 90% of the co-referential links in a text correctly To achieve the same potential, a linear- based system needs to look back 8 units If a system does not look back at all and attempts to resolve co-referential links only within the unit under scrutiny, it has the potential to correctly resolve about 40% of the co- referential links To provide a clearer idea of how the two models differ, Figure 5 shows, for each, the value of the Discourse- VT-k potentials divided by the value of the Linear-k po-Figure 4: The potential of Linear-k and Discourse-VT- tentials For, the potentials of both models arek models to determine correct co-referential links equal because both use only the unit in focus in order to determine co-referential links For, the Discourse- VT-1 model is about 7% better than the Linear-1 model As the value ofincreases, the value Discourse-VT-the corpus and each, we determined the potentials of k/Linear-k converges to 1 both VT-k and Linear-k models to establish correct co- In Figures 6 and 7, we display the number of excep-referential links in that text Fors smaller than 4, the tions, i e , co-referential links that Discourse-VT-k anddifference in potentials was statistically signiﬁcant For Linear-k models cannot determine correctly As oneexample, for, For can see, over the whole corpus, for each, thevalues oflarger than or equal to 4, the difference was no Discourse-VT-k models have the potential to determinelonger signiﬁcant These results are consistent with the correctly about 100 more co-referential links than thegraphs shown in Figure 3 to 7, which all show that the Linear-k models Asincreases, the performance of thepotentials of Discourse-VT-k and Linear-k models con- two models converges verges to the same value as the value ofincreases 3 2 3 Statistical signiﬁcance3 3 Comparing the effort required to establish In order to assess the statistical signiﬁcance of the differ-co-referential links ence between the potentials of the two models to estab-3 3 1 Method lish correct co-referential links, we carried out a Paired-The method described in section 3 2 1 estimates the po- Samples T Test for each In general, a Paired-Samplestential of Linear-k and Discourse-VT-k models to deter- T Test checks whether the mean of casewise differencesmine correct co-referential links by treatingEDRAs as between two variables differs from 0 For each text insets However, from a computational perspective (and Figure 5: A direct comparison of Discourse-VT-kFigure 7: The number of co-referential links that cannot and Linear-VT-k potentials to correctly determine co-be correctly determined by Discourse-VT-k and Linear-k referential links models of units betweenand theﬁrst unit inEDRAthat contains a co-referential expression of The effortof a modelto determine cor- rect co-referential links for all referential expressions in a corpus of textsusingEDRAs of sizewas computed as the sum of the effortsEDRAof all referen- tial expressionsin 3 3 2 Results Figure 8 shows the Discourse-VT-k and Linear-k efforts computed over all referential expressions in the corpus and alls It is possible, for a given referentand a given, that no co-referential link exists in the units of Figure 6: The number of co-referential links that cannot the correspondingEDRA In this case, we consider that be correctly determined by Discourse-VT-k and Linear-k the effort is equal to As a consequence, for smalls models the effort required to establish co-referential links is sim- ilar for both theories, because both can establish only a limited number of links However, asincreases, the presumably, from a psycholinguistic perspective as well)effort computed over the entire corpus diverges dramat- it also makes sense to compare theeffortrequired by theically: using the Discourse-VT model, the search space two classes of models to establish correct co-referentialfor co-referential links is reduced by about 800 units for a links We estimate this effort using a very simple metriccorpus containing roughly 1200 referential expressions that assumes that the closer an antecedent is to a cor- 3 3 3 Statistical signiﬁcance responding referential expression in theEDRA, the bet- ter Hence, in estimating the effort to establish a co-A Paired-Samples T Test was performed for each For referential link, we treatEDRAs as ordered lists For ex-each text in the corpus and each, we determined the ample, using the Linear-9 model, to determine the correcteffort of both VT-k and Linear-k models to establish cor- antecedent of the referential expressionthe smaller com-rect co-referential links in that text For alls the dif- panyin unit 9 of Figure 1, it is necessary to search backference in effort was statistically signiﬁcant For exam- through 4 units (to unit 5, which contains the referentGe-ple, for, we obtained the values netic Therapy) Had unit 5 beenMr Cassey succeeds M These results are intuitive: because James Barrett, 50,we would have had to go back 8 unitsEDRAs are treated as ordered lists and not as sets, the (to unit 1) in order to correctly resolve the REthe smallereffect of the discourse structure on establishing correct company In contrast, in the Discourse-VT-9 model, weco-referential links is not diminished asincreases go back only 2 units because unit 1 is two units away from unit 9 (EDRA) 4 Conclusion We consider that the effortEDRAof aWe analyzed empirically the potentials of discourse and modelto determine correct co-referential links withlinear models of text to determine co-referential links respect to one referentialin unit, given a correspond-Our analysis suggests that by exploiting the hierarchi- ingEDRAof sizeEDRAis given by the numbercal structure of texts, one can increase the potential Niyu Ge, John Hale, and Eugene Charniak 1998 A sta- tistical approach to anaphora resolution InProceed- ings of the Sixth Workshop on Very Large Corpora, pages 161–170, Montreal, Canada, August 15-16 Barbara J Grosz and Candace L Sidner 1986 At- tention, intentions, and the structure of discourse Computational Linguistics, 12(3):175–204, July– September Barbara J Grosz, Aravind K Joshi, and Scott Weinstein 1995 Centering: A framework for modeling the lo- cal coherence of discourse Computational Linguis- tics, 21(2):203–226, June Lynette Hirschman and Nancy Chinchor, 1997 MUC-7 Coreference Task Deﬁnition, July 13 Figure 8: The effort required by Linear-k and Discourse- Janet Hitzeman and Massimo Poesio 1998 Long dis- VT-k models to determine correct co-referential links tance pronominalization and global focus InPro- ceedings of the 36th Annual Meeting of the Associ- ation for Computational Linguistics and of the 17th International Conference on Computational Linguis- of natural language systems to correctly determine co-tics (COLING/ACL’98), pages 550–556, Montreal, referential links, which is a requirement for correctly re-Canada, August solving anaphors If one treats all discourse units in theJerry H Hobbs 1978 Resolving pronoun references preceding discourse equally, the increase is statisticallyLingua, 44:311–338 signiﬁcant only when a discourse-based coreference sys-Megumi Kameyama 1997 Recognizing referential tem looks back at most four discourse units in order tolinks: An information extraction perspective InPro- establish co-referential links However, if one assumesceedings of the ACL/EACL’97 Workshop on Opera- that proximity plays an important role in establishing co-tional Factors in Practical, Robust Anaphora Resolu- referential links and that referential expressions are moretion, pages 46–53 likely to be linked to referees that were used recently inShalom Lappin and Herbert J Leass 1994 An algo- discourse, the increase is statistically signiﬁcant no mat-rithm for pronominal anaphora resolution Computa- ter how many units a discourse-based co-reference sys- tional Linguistics, 20(4):535–561 tem looks back in order to establish co-referential links William C Mann and Sandra A Thompson 1988 Acknowledgements We are grateful to LynetteRhetorical structure theory: Toward a functional the- Hirschman and Nancy Chinchor for making availableory of text organization Text, 8(3):243–281 their corpora of co-reference annotations We are alsoDaniel Marcu, Estibaliz Amorrortu, and Magdalena grateful to Graeme Hirst for comments and feedback onRomera 1999 Experiments in constructing a cor- a previous draft of this paper pus of discourse trees InProceedings of the ACL’99 Workshop on Standards and Tools for Discourse Tag- Referencesging, pages 48–57, University of Maryland, June 22 Saliha Azzam, Kevin Humphreys, and RobertRuslan Mitkov 1997 Factors in anaphora resolution: Gaizauskas 1998 Evaluating a focus-based ap-They are not the only things that matter a case study proach to anaphora resolution InProceedings ofbased on two different approaches InProceedings of the 36th Annual Meeting of the Association forthe ACL/EACL’97 Workshop on Operational Factors Computational Linguistics and of the 17th Inter-in Practical, Robust Anaphora Resolution, pages 14– national Conference on Computational Linguistics21 (COLING/ACL’98), pages 74–78, Montreal, Canada,Candace L Sidner 1981 Focusing for interpretation of August 10–14 pronouns Computational Linguistics, 7(4):217–231, Dan Cristea, Nancy Ide, and Laurent Romary 1998 October–December Veins theory: A model of global discourse cohesionWietske Vonk, Lettica G M M Hustinx, and Wim H G and coherence InProceedings of the 36th AnnualSimons 1992 The use of referential expressions in Meeting of the Association for Computational Lin-structuring discourse Language and Cognitive Pro- guistics and of the 17th International Conference oncesses, 7(3,4):301–333 Computational Linguistics (COLING/ACL’98), pages 281–285, Montreal, Canada, August Barbara Fox 1987 Discourse Structure and Anaphora Cambridge Studies in Linguistics; 48 Cambridge Uni- versity Press 